red_wine <- read_csv("data/wine_quality_red.csv")
white_wine <- read_csv("data/wine_quality_white.csv")
Model the quality of the wine data to determine which
physiochemical properties make a wine ‘good’.
For the purposes of this exercise, I will create separate models for white win and red wine as I am interested in whether the propertise that make win ‘good’ are different depending on the type of wine.
Other interesting analysis could involve joining the two data sets to determine whether or not the type of wine effects that quality score.
glimpse(red_wine)
Rows: 1,599
Columns: 14
$ wine_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ fixed_acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5, 6.7, 7.5,…
$ volatile_acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, 0.650, 0.58…
$ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0.02, 0.36, …
$ residual_sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1, 1.8, 6.1, …
$ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, 0.065, 0.07…
$ free_sulfur_dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16, 9, 52, 51…
$ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102, 59, 29, 14…
$ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0.9964, 0.99…
$ p_h <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3.36, 3.35, …
$ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.80, …
$ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.5, 9.2, 10.…
$ quality <dbl> 4.66, 4.82, 4.91, 5.96, 5.42, 4.78, 5.09, 6.78, 7.10, 4.78, …
$ region <chr> "Australia", "Spain", "Spain", "Australia", "Italy", "Italy"…
alias(quality ~ ., data = red_wine)
Model :
quality ~ wine_id + fixed_acidity + volatile_acidity + citric_acid +
residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide +
density + p_h + sulphates + alcohol + region
wine_id.alias()p_h a factor as it
does have a fixed range of 0-14. From research online:“there are two scales that can be used to measure how acidic wine is, which include pH and total acidity, the latter of which is measured in grams per liter. The pH measurement aims to identify the total concentration of hydrogen ions that are present in the solution. (https://sensorex.com/ph-wine-making)”
red_wine_clean <- red_wine %>%
select(-wine_id) %>%
mutate(region = as.factor(region))
To make it easier to initially explore potential predictor candidates, I have seperate the variables into 4 groups based on the context provided by the data dictionary.
Group 1 / Acidity: fixed_acidity, volatile_acidity, citric_acid, p_h Group 2 / Sulfur Dioxide: free_sulfur_dioxide, total_sulfur_dioxide, sulphates Group 3 / Other Physiochemical Properties: residual_sugar, chlorides, density, alcohol Group 4 / Location: region
fixed_acidity, volatile_acidity, citric_acid, p_h
red_wine_clean %>%
select(quality, fixed_acidity, volatile_acidity, citric_acid, p_h) %>%
ggpairs(progress = FALSE)
volatitle_acidity which has a weak negative
correlation (-0.364) that is statistically significant.free_sulfur_dioxide, total_sulfur_dioxide, sulphates
red_wine_clean %>%
select(quality, free_sulfur_dioxide, total_sulfur_dioxide, sulphates) %>%
ggpairs(progress = FALSE)
sulphates which has a weak positive
correlation (0.230) that is statistically significant.residual_sugar, chlorides, density, alcohol
red_wine_clean %>%
select(quality, residual_sugar, chlorides, alcohol) %>%
ggpairs(progress = FALSE)
alcohol has a moderate poisitve correlation (0.454)
with quality that is statistically significant.region
red_wine_clean %>%
select(quality, region) %>%
ggplot() +
geom_boxplot(aes(x = quality, y = region))
NA
region does not appear to be strongly correlated to
quality.After exploration, the following look like the best initial candidates:
alcohol has a moderate positive correlation (0.454)
volatile_acidity which has a weak negative correlation
(-0.364) sulphates which has a weak positive correlation
(0.230)
I will now check for any correlation between these three candidates.
red_wine_clean %>%
select(alcohol, volatile_acidity, sulphates) %>%
ggpairs(progress = FALSE)
No evidence of strong correlation between these candidates, so for ease of working with I will now add these to a new tibble.
red_wine_trim <- red_wine_clean %>%
select(quality, alcohol, volatile_acidity, sulphates)
model_1a <- lm(quality ~ alcohol,
data = red_wine_trim)
model_1b <- lm(quality ~ volatile_acidity,
data = red_wine_trim)
model_1c <- lm(quality ~ sulphates,
data = red_wine_trim)
autoplot(model_1a)
Plot 1: No strong evidence of a pattern. Plot 2: Distribution of standardised residuals appears fairly normal. Plot 3: No strong evidence of funneling.
I will consider these diagnostics as acceptable.
autoplot(model_1b)
Plot 1: No strong evidence of a pattern. Plot 2: Distribution of standardised residuals appears fairly normal. Plot 3: No strong evidence of funneling.
I will consider these diagnostics as acceptable.
autoplot(model_1c)
Plot 1: Some evidence of pattern. Plot 2: Distribution of standardised residuals appears fairly normal. Plot 3:Some evidence of pattern.
I will consider these diagnostics as not-acceptable.
As model_1c was deemed to have failed diagnostics, I
will only check the summaries for model_1a and
model_1b.
summary(model_1a)
Call:
lm(formula = quality ~ alcohol, data = red_wine_trim)
Residuals:
Min 1Q Median 3Q Max
-2.95517 -0.50166 -0.04501 0.49519 2.70011
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.89570 0.18603 10.19 <2e-16 ***
alcohol 0.36142 0.01776 20.36 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7564 on 1597 degrees of freedom
Multiple R-squared: 0.206, Adjusted R-squared: 0.2055
F-statistic: 414.4 on 1 and 1597 DF, p-value: < 2.2e-16
Multiple R-Squared: 0.206 P-Value: <2e-16 *** Residual Standard Error: 0.7564
summary(model_1b)
Call:
lm(formula = quality ~ volatile_acidity, data = red_wine_trim)
Residuals:
Min 1Q Median 3Q Max
-2.74440 -0.54813 -0.02734 0.53280 2.77353
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.57413 0.06155 106.81 <2e-16 ***
volatile_acidity -1.72666 0.11044 -15.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7905 on 1597 degrees of freedom
Multiple R-squared: 0.1327, Adjusted R-squared: 0.1322
F-statistic: 244.4 on 1 and 1597 DF, p-value: < 2.2e-16
Multiple R-Squared: 0.1327 P-Value: <2e-16 *** Residual Standard Error: 0.7905
The effect of both alcohol and
volatile_acidity on quality was found to be
statistically significant. As alcohol was found to have the
greater Multiple R-Squared value, this will be the first predictor I add
to the model.
From the below volatile_acidity appears to have the
highest correlation, which is also statistically significant with the
residuals. This will therefor be the second predictor that I add to the
model.
red_win_resid <- red_wine_trim %>%
add_residuals(model_1a) %>%
select(-quality, -alcohol)
red_win_resid %>%
ggpairs(progress = FALSE)
model_2a <- lm(quality ~ alcohol + volatile_acidity,
data = red_wine_trim)
autoplot(model_2a)
Plot 1: No strong evidence of a pattern. Plot 2: Distribution of standardised residuals appears fairly normal. Plot 3: No strong evidence of funneling.
I will consider these diagnostics as acceptable.
summary(model_2a)
Call:
lm(formula = quality ~ alcohol + volatile_acidity, data = red_wine_trim)
Residuals:
Min 1Q Median 3Q Max
-2.55192 -0.47030 -0.01778 0.47927 2.41435
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.08357 0.19857 15.53 <2e-16 ***
alcohol 0.31565 0.01723 18.32 <2e-16 ***
volatile_acidity -1.34665 0.10254 -13.13 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7188 on 1596 degrees of freedom
Multiple R-squared: 0.2834, Adjusted R-squared: 0.2825
F-statistic: 315.7 on 2 and 1596 DF, p-value: < 2.2e-16
Multiple R-Squared: 0.2834 alcohol P-Value: <2e-16 volatile_acidity P-Value: <2e-16 Residual Standard Error: 0.7188
The effect of both alcohol and the newly added predictor
volatile_acidity on quality was found to be
statistically significant. The R-squared value has increase FROM 0.206
TO 0.2834.
red_win_resid <- red_wine_trim %>%
add_residuals(model_2a) %>%
select(-quality, -alcohol, -volatile_acidity)
red_win_resid %>%
ggpairs(progress = FALSE)
The correlation between sulphates and the residuals is
very weak although it is indicated as being statistically significant.
If the model already contained a large number of predictors I may not
add to it’s complexity by adding sulphates based on it’s
weak correlation; however, as the model only has 2 predictors so far I
will try adding sulphates to the model.
I will then use a K-fold cross validate to compare the model with
sulphates to the model without sulphates.
model_3a <- lm(quality ~ alcohol + volatile_acidity + sulphates,
data = red_wine_trim)
autoplot(model_3a)
Plot 1: No strong evidence of a pattern. Plot 2: Distribution of standardised residuals appears fairly normal. Plot 3: No strong evidence of funneling.
I will consider these diagnostics as acceptable.
summary(model_3a)
Call:
lm(formula = quality ~ alcohol + volatile_acidity + sulphates,
data = red_wine_trim)
Residuals:
Min 1Q Median 3Q Max
-2.66345 -0.45393 -0.01326 0.48807 2.16111
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.62881 0.21133 12.439 < 2e-16 ***
alcohol 0.31134 0.01707 18.241 < 2e-16 ***
volatile_acidity -1.19442 0.10476 -11.401 < 2e-16 ***
sulphates 0.63716 0.10886 5.853 5.84e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7114 on 1595 degrees of freedom
Multiple R-squared: 0.2985, Adjusted R-squared: 0.2972
F-statistic: 226.2 on 3 and 1595 DF, p-value: < 2.2e-16
Multiple R-Squared: 0.2834 alcohol P-Value: <2e-16
volatile_acidity P-Value: <2e-16
sulphates P-Value: 5.84e-09 *** Residual
Standard Error: 0.7114
The effect of all three predictors appears to be statistically
significant; however, there was only a small resultant increase in
R-Squared when adding sulphates with this increasing FROM
0.2834 TO 0.2985.
cv_10_fold <- trainControl(
method = "cv",
number = 10,
savePredictions = TRUE
)
# Set Up for Model 2a
model_1 <- train(
quality ~ alcohol + volatile_acidity, # model formula
data = red_wine_trim,
trControl = cv_10_fold, # pass the model training specs
method = "lm"
)
# Set Up for Model 3a
model_2 <- train(
quality ~ quality ~ alcohol + volatile_acidity + sulphates, # model formula
data = red_wine_trim,
trControl = cv_10_fold, # pass the model training specs
method = "lm"
)
Error in model.frame.default(form = quality ~ quality ~ alcohol + volatile_acidity + :
object is not a matrix
To compare the models we will look at the Root Mean Standard Error AND the mean R-Squared Value.
mean(model_1$resample$RMSE)
[1] 0.7182499
mean(model_1$resample$Rsquared)
[1] 0.2835953
mean(model_2$resample$RMSE)
[1] 0.7113169
mean(model_2$resample$Rsquared)
[1] 0.3002908
anova(model_2a, model_3a)
Analysis of Variance Table
Model 1: quality ~ alcohol + volatile_acidity
Model 2: quality ~ alcohol + volatile_acidity + sulphates
Res.Df RSS Df Sum of Sq F Pr(>F)
1 1596 824.54
2 1595 807.20 1 17.338 34.26 5.842e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The anova has returned a very small p-value, from which we can
interpret that adding sulphates to the model did make a
statistically significant impact on the fit of the model.
Model: quality ~ alcohol + volatile_acidity Root Mean Standard Error: 0.7182499 R-Squared: 0.2835953
Model: quality ~ alcohol + volatile_acidity + sulphates Root Mean Standard Error: 0.7113169 R-Squared: 0.3002908
From the k-fold cross validation results, we can see that adding
sulphates to the model resulted in a decrease in Root Mean
Standard Error and an increase in R-Squared. From the Anova, we can see
that add sulpahates to the model made a statistically
significant impact on the fit of the model. Therefor, I would conclude
that the below model is the best fit:
quality ~ alcohol + volatile_acidity + sulphates
Final Model: quality ~ alcohol + volatile_acidity + sulphates
Interpretation: ~30% of the variation in the quality score assigned to wine can be explained by a combination of the percentage of alcohol, the amount of acetic acid in wine and the amount of sulphates.
Example: For a wine with 10% alcohol, 0.5 g / dm^3 acetic acid and 0.6 g / dm3 potassium sulphate, the predicted quality score would be 5.527296 (Low Estimate 4.815979, High Estimate 6.238613).
As calculated below.
Step 1. outcome = b0 + b1 * 1st Predictor + b2 * 2nd Predictor + b3 * 3rd Predictor
Step 2. quality = b0(intercept) + b1(coefficient) * 1st Predictor + b2(coefficient) * 2nd Predictor
Step 3. quality = b0(intercept) + b1(coefficient) *
alcohol + b2(coefficient) * volatile_acidity +
b3(coefficient) * sulphates
Step 4. quality = 2.62881 + 0.31134 * 10 + -1.19442 *
0.5 + 0.63716 * 0.6
Step 5. quality = 5.527296 (Low Estimate 4.815979, High Estimate 6.238613)
# Taking quality +- the RMSE, we can say quality within +- bracket.
5.527296 + c(-0.7113169, 0.7113169)
[1] 4.815979 6.238613
glimpse(white_wine)
Rows: 4,898
Columns: 14
$ wine_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
$ fixed_acidity <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, 6.3, 8.1, 8.1, 8.6, 7.9,…
$ volatile_acidity <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32, 0.27, 0.30, 0.22, 0.27…
$ citric_acid <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16, 0.36, 0.34, 0.43, 0.41…
$ residual_sugar <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.00, 20.70, 1.60, 1.50, 1.…
$ chlorides <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.050, 0.045, 0.045, 0.049, 0…
$ free_sulfur_dioxide <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, 11, 17, 16, 48, 41, 28, …
$ total_sulfur_dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 132, 129, 63, 109, 75, 143…
$ density <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, 0.9951, 0.9949, 1.0010, …
$ p_h <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18, 3.00, 3.30, 3.22, 2.99…
$ sulphates <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47, 0.45, 0.49, 0.45, 0.56…
$ alcohol <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8, 9.5, 11.0, 12.0, 9.7, …
$ quality <dbl> 5.92, 6.06, 6.25, 6.38, 6.35, 5.82, 5.53, 6.17, 6.08, 5.86, 4.91…
$ region <chr> "Australia", "Australia", "Spain", "Spain", "USA", "Italy", "Aus…
alias(quality ~ ., data = red_wine)
summary(white_wine)
white_wine %>%
skim() %>%
view()
wine_id.alias()p_h a factor as it
does have a fixed range of 0-14. From research online:“there are two scales that can be used to measure how acidic wine is, which include pH and total acidity, the latter of which is measured in grams per liter. The pH measurement aims to identify the total concentration of hydrogen ions that are present in the solution. (https://sensorex.com/ph-wine-making)”
white_wine_clean <- white_wine %>%
select(-wine_id) %>%
mutate(region = as.factor(region))
To make it easier to initially explore potential predictor candidates, I will seperate the variables into the same groups as used for red wine.
Group 1 / Acidity: fixed_acidity, volatile_acidity, citric_acid, p_h Group 2 / Sulfur Dioxide: free_sulfur_dioxide, total_sulfur_dioxide, sulphates Group 3 / Other Physiochemical Properties: residual_sugar, chlorides, density, alcohol Group 4 / Location: region
fixed_acidity, volatile_acidity, citric_acid, p_h
white_wine_clean %>%
select(quality, fixed_acidity, volatile_acidity, citric_acid, p_h) %>%
ggpairs(progress = FALSE)
volatitle_acidity which has a very weak
negative correlation (-0.190) that is statistically significant.free_sulfur_dioxide, total_sulfur_dioxide, sulphates
white_wine_clean %>%
select(quality, free_sulfur_dioxide, total_sulfur_dioxide, sulphates) %>%
ggpairs(progress = FALSE)
total_sulfur_dioxide which has a very weak
negative correlation (-0.176) that is statistically significant.residual_sugar, chlorides, density, alcohol
white_wine_clean %>%
select(quality, residual_sugar, chlorides, alcohol) %>%
ggpairs(progress = FALSE)
alcohol has a moderate positive correlation (0.421)
with quality that is statistically significant.
chlorides have a week negative correlation (-0.203)
with quality that is statistically significant.
region
white_wine_clean %>%
select(quality, region) %>%
ggplot() +
geom_boxplot(aes(x = quality, y = region))
NA
region does not appear to be strongly correlated to
quality.After exploration, the following look like the best initial candidates:
alcohol has a moderate positive correlation (0.421)
chlorides has a week negative correlation (-0.203)
volatitle_acidity has a very weak negative correlation
(-0.190) total_sulfur_dioxide which has a very weak
negative correlation (-0.176)
At this stage, I note that alcohol and
volatile_acidity were included in the red wine model,
although chlorides and total_sulfur_dioxide
were note. Similar to red wine however, only alcohol had a
correlation strength of moderate or higher.
I will now check for any correlation between thee four potential candidates.
white_wine_clean %>%
select(alcohol, chlorides, volatile_acidity, total_sulfur_dioxide) %>%
ggpairs(progress = FALSE)
alcohol and total_sulfur_dioxide appear to
have a moderately negative correlation (-0.449).alcohol and chlorides appear to have a
weak negative correlation (-0.360).I will keep this in mind when building my model, but for now I will retain all four potential predcitors as there are no strong correlations. For ease of working with I will now add these to a new tibble.
white_wine_trim <- white_wine_clean %>%
select(quality, alcohol, chlorides, volatile_acidity, total_sulfur_dioxide)
model_1a_w <- lm(quality ~ alcohol,
data = white_wine_trim)
model_1b_w <- lm(quality ~ chlorides,
data = white_wine_trim)
model_1c_w <- lm(quality ~ volatile_acidity,
data = white_wine_trim)
model_1d_w <- lm(quality ~ total_sulfur_dioxide,
data = white_wine_trim)
autoplot(model_1a_w)
Plot 1: No strong evidence of a pattern. Plot 2: Distribution of standardised residuals appears fairly normal. Plot 3: No strong evidence of funneling.
I will consider these diagnostics as acceptable.
autoplot(model_1b_w)
Plot 1: Some evidence of a pattern, with residuals for lower fitted values having a tendancy to be mainly positive wheras residuals for higher fitted values appear evenly distributed between positive and negative. Plot 2: Distribution of standardised residuals appears fairly normal. Plot 3: No strong evidence of funneling.
Plot 1 tests for independence with plot 3 looking for a consistency
of variation. It may be that the diagnostics are reflecting the
correlation between chlorides and alcohol.
I will consider these diagnostics not acceptable and will not consider summary data for this model.
autoplot(model_1c_w)
Plot 1: No strong evidence of a pattern. Plot 2: Distribution of standardised residuals appears fairly normal. Plot 3: No strong evidence of funneling.
I will consider these diagnostics as acceptable.
autoplot(model_1d_w)
Plot 1: Some evidence of a pattern. Plot 2: Distribution of standardised residuals appears fairly normal. Plot 3: No strong evidence of funneling.
I will consider these diagnostics as acceptable.
As model_1b_w was deemed to have failed diagnostics, I
will only check the summaries for model_1a_w,
model_1c_w and model_1d_w
summary(model_1a_w)
Call:
lm(formula = quality ~ alcohol, data = white_wine_trim)
Residuals:
Min 1Q Median 3Q Max
-3.5247 -0.5623 -0.0207 0.5432 3.2832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.523661 0.104913 24.05 <2e-16 ***
alcohol 0.321456 0.009911 32.44 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8535 on 4896 degrees of freedom
Multiple R-squared: 0.1769, Adjusted R-squared: 0.1767
F-statistic: 1052 on 1 and 4896 DF, p-value: < 2.2e-16
Multiple R-Squared: 0.1769 P-Value: <2e-16 *** Residual Standard Error: 0.8535
summary(model_1c_w)
Call:
lm(formula = quality ~ volatile_acidity, data = white_wine_trim)
Residuals:
Min 1Q Median 3Q Max
-3.4823 -0.6262 -0.0278 0.5396 3.5041
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.39707 0.03875 165.09 <2e-16 ***
volatile_acidity -1.77377 0.13094 -13.55 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9236 on 4896 degrees of freedom
Multiple R-squared: 0.03613, Adjusted R-squared: 0.03593
F-statistic: 183.5 on 1 and 4896 DF, p-value: < 2.2e-16
Multiple R-Squared: 0.03613 P-Value: <2e-16 *** Residual Standard Error: 0.9236
summary(model_1d_w)
Call:
lm(formula = quality ~ total_sulfur_dioxide, data = white_wine_trim)
Residuals:
Min 1Q Median 3Q Max
-3.3205 -0.6299 0.0092 0.5551 3.4379
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4415046 0.0450711 142.92 <2e-16 ***
total_sulfur_dioxide -0.0038882 0.0003114 -12.49 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9261 on 4896 degrees of freedom
Multiple R-squared: 0.03086, Adjusted R-squared: 0.03066
F-statistic: 155.9 on 1 and 4896 DF, p-value: < 2.2e-16
Multiple R-Squared: 0.03086 P-Value: <2e-16 *** Residual Standard Error: 0.9261
The effect of both alcohol,
volatile_acidity and total_sulfur_dioxide on
quality was found to be statistically significant. As
alcohol was found to have the greater Multiple R-Squared
value, this will be the first predictor I add to the model.
From the below total_sulfur_dioxide appears to have the
highest correlation, which is also statistically significant with the
residuals. This will therefor be the second predictor that I add to the
model.
white_win_resid <- white_wine_trim %>%
add_residuals(model_1a_w) %>%
select(-quality, -alcohol)
model_2a_w <- lm(quality ~ alcohol + total_sulfur_dioxide,
data = white_wine_trim)
autoplot(model_2a_w)
Plot 1: No strong evidence of a pattern. Plot 2: Distribution of standardised residuals appears fairly normal. Plot 3: No strong evidence of funneling.
I will consider these diagnostics as acceptable.
summary(model_2a_w)
Call:
lm(formula = quality ~ alcohol + total_sulfur_dioxide, data = white_wine_trim)
Residuals:
Min 1Q Median 3Q Max
-3.5207 -0.5644 -0.0201 0.5445 3.2891
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.4141001 0.1427312 16.914 <2e-16 ***
alcohol 0.3270918 0.0110904 29.493 <2e-16 ***
total_sulfur_dioxide 0.0003636 0.0003211 1.132 0.258
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8534 on 4895 degrees of freedom
Multiple R-squared: 0.1771, Adjusted R-squared: 0.1768
F-statistic: 526.7 on 2 and 4895 DF, p-value: < 2.2e-16
Multiple R-Squared: 0.1771 alcohol P-Value: <2e-16 *** total_sulfur_dioxide P-Value: 0.258 Residual Standard Error: 0.8534
Multiple R-Squared: 0.1769
The effect of alcohol remains statistically significant;
however, the effect of total_sulfur_dioxide is not
statistically significant. In addition, the R-Squared value has only
minimally increased from ‘0.1769’ to ‘0.1771’.
With this in mind, I will not proceed in adding
total_sulfur_dioxide to the model. In addition, based on
the week correlation between the other remaining potential predcitors
and residuals, I do not believe it would be of benefit to add any other
predictors to this model based on the current data.
Final Model: quality ~ alcohol
Interpretation: ~18% of the variation in the quality score assigned to wine can be explained by the the percentage of alcohol.
Example: For a wine with 10.5% alcohol the predicted quality score would be 5.898949 (Low Estimate 5.045449, High Estimate 6.752449).
As calculated below.
Step 1. outcome = b0 + b1 * 1st Predictor
Step 2. quality = b0(intercept) + b1(coefficient) * 1st Predictor
Step 3. quality = b0(intercept) + b1(coefficient) *
alcohol
Step 4. quality = 2.523661 + 0.321456 * 10.5
Step 5. quality = 5.898949 (Low Estimate 5.045449, High Estimate 6.752449)
# Taking quality +- the Residual Standard Error, we can say quality within +- bracket.
5.898949 + c(-0.8535, 0.8535)
[1] 5.045449 6.752449
Automated model building using forward selection will be used to add further insight and for the purposes of comparison against the manualy built models.
Using forward selection, and limiting the model to 3 predictors (the same as our model), the following model was selected.
quality ~ alcohol + volatile_acidity +
sulphates
This is the same as the model found from manual model building.
reg_forward_red <- regsubsets(quality ~ .,
data = red_wine_clean,
nvmax = 3,
method = "forward")
summary(reg_forward_red)
Subset selection object
Call: regsubsets.formula(quality ~ ., data = red_wine_clean, nvmax = 3,
method = "forward")
15 Variables (and intercept)
Forced in Forced out
fixed_acidity FALSE FALSE
volatile_acidity FALSE FALSE
citric_acid FALSE FALSE
residual_sugar FALSE FALSE
chlorides FALSE FALSE
free_sulfur_dioxide FALSE FALSE
total_sulfur_dioxide FALSE FALSE
density FALSE FALSE
p_h FALSE FALSE
sulphates FALSE FALSE
alcohol FALSE FALSE
regionFrance FALSE FALSE
regionItaly FALSE FALSE
regionSpain FALSE FALSE
regionUSA FALSE FALSE
1 subsets of each size up to 3
Selection Algorithm: forward
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " "*" " " " " " "
3 ( 1 ) " " "*" " " " " " "
free_sulfur_dioxide total_sulfur_dioxide density p_h sulphates alcohol regionFrance
1 ( 1 ) " " " " " " " " " " "*" " "
2 ( 1 ) " " " " " " " " " " "*" " "
3 ( 1 ) " " " " " " " " "*" "*" " "
regionItaly regionSpain regionUSA
1 ( 1 ) " " " " " "
2 ( 1 ) " " " " " "
3 ( 1 ) " " " " " "
Using forward selection, and limiting the model to 3 predictors (the same the red wine model), automated model suggests that the following model would provide the highest adjusted R-Squared.
quality ~ alcohol + volatile_acidity + sulphates
However, when considering the BIC which penalises models for
oxercomplexity and where lower is best, there is a notable increase
between the use of alcohol as a single predictor and the
addition of further predictors. This would support the model chosen
through manual model building.
reg_forward_white <- regsubsets(quality ~ .,
data = red_wine_clean,
nvmax = 3,
method = "forward")
summary(reg_forward_white)
Subset selection object
Call: regsubsets.formula(quality ~ ., data = red_wine_clean, nvmax = 3,
method = "forward")
15 Variables (and intercept)
Forced in Forced out
fixed_acidity FALSE FALSE
volatile_acidity FALSE FALSE
citric_acid FALSE FALSE
residual_sugar FALSE FALSE
chlorides FALSE FALSE
free_sulfur_dioxide FALSE FALSE
total_sulfur_dioxide FALSE FALSE
density FALSE FALSE
p_h FALSE FALSE
sulphates FALSE FALSE
alcohol FALSE FALSE
regionFrance FALSE FALSE
regionItaly FALSE FALSE
regionSpain FALSE FALSE
regionUSA FALSE FALSE
1 subsets of each size up to 3
Selection Algorithm: forward
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " "*" " " " " " "
3 ( 1 ) " " "*" " " " " " "
free_sulfur_dioxide total_sulfur_dioxide density p_h sulphates alcohol regionFrance
1 ( 1 ) " " " " " " " " " " "*" " "
2 ( 1 ) " " " " " " " " " " "*" " "
3 ( 1 ) " " " " " " " " "*" "*" " "
regionItaly regionSpain regionUSA
1 ( 1 ) " " " " " "
2 ( 1 ) " " " " " "
3 ( 1 ) " " " " " "
plot(reg_forward_white, scale = "adjr2")
plot(reg_forward_white, scale = "bic")
For both red and white wine, the percentage of alcohol is the main factor which explains the variation in quality scores assigned to a whine.
For red wine, the percentage of alcohol together with the amount of acetic acid and sulfates, explains ~30% of the variation in the quality score assigned.
For white wine, the percentage of alcohol explains ~18% of variation in the quality score assigned. No other variable was deemed to significantly benefit the model.